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Abstract 

Variable selection techniques have become increasingly popular amongst statisticians 
due to an increased number of regression and classification applications involving high- 
dimensional data where we expect some predictors to be unimportant. In this context, 
Bayesian variable selection techniques involving Markov chain Monte Carlo exploration of 
the posterior distribution over models can be prohibitively computationally expensive and 
so there has been attention paid to quasi-Bayesian approaches such as maximum a poste- 
riori (MAP) estimation using priors that induce sparsity in such estimates. We focus on 
this latter approach, expanding on the hierarchies proposed to date to provide a Bayesian 
interpretation and generalization of state-of-the-art penalized optimization approaches and 
providing simultaneously a natural way to include prior information about parameters within 
this framework. We give examples of how to use this hierarchy to compute MAP estimates 
for linear and logistic regression as well as sparse precision-matrix estimates in Gaussian 
graphical models. In addition, an adaptive group lasso method is derived using the frame- 
work. 



1 Introduction 



There has been recent interest in sparse estimates for coefficients in regression problems, with 
this problem often termed variable selection in the literature. To this end, a variety of ap- 
proaches have been proposed in both the statistics and signal processing literatures. Most of 
the computationally tractable approaches are the solutions of penalized optimization problems 
associated with regularization of the coefficients in likelihood optimization. Although not truly 
Bayesian approaches, often they can be interpreted as maximum a posteriori (MAP) estimates 
associated with the posterior density of the coefficients where the prior induces the regularization 
used in the optimization routine. 

Denoting the coefficients by (3 £ W, a popular family of these computing estimates as solutions 
to penalized optimization problems involving the log-likelihood of the data given j3 and £ q 
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penalization on the coefficients with < q < 1. When q is in this range, the solutions are sparse 
for large enough values of multiplicative penalization weights. When q > 1, the penalization is 
additionally convex, a property that has made the choice of q = 1 particularly suitable when 
the log-likelihood is concave as this leads to a unique global maxima for the objective function. 

Beginning with pQ, it has become popular practice to use ^i-regularization on each component 
of (3. However, use of identical penalization on each coefficient, e.g. A][^ =1 \ j3j\ can lead to 
unacceptable bias in the resulting estimates [2], which has motivated use of sparsity-inducing 
non-convex penalties despite the increased difficulty in computing the resulting estimates. In 
particular, this has led to the adoption of "adaptive" methods [21 H] in the statistics literature 
and iteratively reweighted methods [5, 6j in the signal processing literature. 

We propose a hierarchical prior for (3 that amounts marginally to a sparsity-inducing, non- 
convex penalty in MAP estimation. Further, the specific hierarchy gives rise to an expectation- 
maximization (EM) algorithm [7] that is essentially an iteratively reweighted ^-minimization 
algorithm. In one case, the algorithm corresponds to the iteratively reweighted ^i-minimization 
algorithm and has been independently suggested in both [8J and [9J. Our hierarchical formu- 
lation of the prior, in contrast, allows users to incorporate prior information about different 
coefficients and allows flexibility in grouping variables together. For example, the framework 
gives immediately an adaptive version of the group lasso algorithm proposed in |10| , 



2 The hierarchical adaptive lasso (HAL) 



We are interested in prior distributions for (3 in a general regression settings. Let = 
{1, . . . , k}. We are given n observations {yi}f=i an d associated with each observation a vector of 
covariates Xj E MP for i £T n . We assume that the conditional distribution of each m is indepen- 
dent given Xj and has density /(y|x, /3, 6), where (3 E W and 9 E parametrize the distribution 
of y conditional on x. Defining y = (yi, • ■ ■ , y n )' E W 1 and X = f (x' 1; . . . , x^)' E W nxp , the condi- 
tional distribution of all of the observations is then given by /(y|X, (3, 9) = Y\7=i /(y«l x ?; A 
We are primarily interested in the parameter (3 and assume that each component j3j has special 
meaning when equal to 0. 

While a Bayesian approach would usually suggest approximating the posterior density 

p({3\y,X,9)<xf(y\X,(3,9)p(l3\9), 

we focus here on MAP (point) estimates of f3 since these are computationally easier to compute, 
especially when f(y\~K, (3,9) and p((3\9) are concave, and their use is not uncommon when p 
and/or n are large. MAP estimates are computed by solving the optimization problem 

Pmap = argmax/(y|X,/MM/3|0) 

or, equivalently 

Pmap = argmaxlog/(y|X,/3,6l) + logp(/3|6>) 

The log p{(3\9) can be thought of as a penalization term in optimizing the log-likelihood of the 
data log /(y|X, 0,0). 
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2.1 Generalized t-distribution prior 



We propose a hierarchical approach to constructing priors for (3. At the lowest level, we give 

jf , ie. p((3\al p ) = 



each element /3j of (3 an independent normal prior with mean and variance a 2 -, ie. p((3\af. 



rij=i P(Pj\ a j ) where f3j\a"j ~ -/V(0, cr|). If we leave cj for j E T p as hyperparameters, computing 
the resulting MAP estimate corresponds to .^-penalized optimization of the log-likelihood. 

If, instead, we model each crj as being drawn from an exponential distribution with mean 2r? 
we obtain a double-exponential distribution for (3j after <rj has been integrated out, ie. 

M/3 J |r,) = 9 L exp(-^) 

Computing the MAP estimate associated with this prior corresponds to ^-penalized optimiza- 
tion and the solution itself is identical to the LASSO estimate when / is a multivariate Gaussian 
with mean X/3. This prior has become popular in recent years for variable selection since it 
induces sparsity in (3 map for small enough values of Tj. 

We propose adding another level of hierarchy to the prior by having separate random variables 
Tj for each j E T p and placing inverse-gamma priors on each Tj. Indeed, if we let Tj ~ IG(a,j, bj) 
we obtain 

after integrating out Tj, which we call the hierarchical adaptive lasso (HAL) prior since one can 
compute MAP estimates using this prior with a type of adaptive lasso algorithm, as can be seen 



in Section 2.2 This is the density of a generalized t-distribution. Computing the MAP estimate 
associated with this prior corresponds to logarithmic penalization of the log-likelihood. From a 
Bayesian modelling perspective, the introduction of a distribution over the r,- is a natural way to 
resolve the issue of believing that there are significant differences in the sizes of the coefficients 
of (3 that cannot be modelled as having come from a distribution with as thin tails as a Laplace 
distribution. 



2.2 Computing MAP estimates 

The optimization problem associated with the generalized t-distribution prior is not concave. 
However, one can find local modes of the posterior using the EM algorithm with the r = t\- p 
as latent variables. Indeed, each iteration of EM takes the form 

/3(* +1 ) = axgmaxlog/(y|X,/3,0) + j log[ P ((3\r)] P (T\f3^ ,a,b)dr 
The conjugacy of the inverse-gamma distribution with respect to the Laplace distribution gives 



Tj\pf\a„b^IG(a, + 1,6, + |&|) 



and withp(/3|r) = n?=iK&ki) = IT^i exp(-|/W^) yields 

/ 3(*+i) = argmaxlog/(y|X,A^)-El^l / -P^Wf ,a 3 ,bj)d7 
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where the expectation of 1/tj given tj ~ IG(aj + 1, bj + '\) is (cij + l)/(bj + |). 

As such, one can find a local mode of the posterior p(/3\y, X, j3, 9) by starting at some point 
/3(°) and then iteratively solving 

p 

p(t+i) = argm axlog/(y|X,/3,fl) -2>f|&| 

i=i 

where 

5 6i + |/3f| 

It is clear that for large enough values of a,j and small enough values of bj that the MAP 
estimates obtained by the EM algorithm are sparse. In fact, any posterior mode with this 
prior corresponds to a weighted lasso solution, which is sparse when the penalization through 
{(aj,bj)}^ =1 is large enough. 

2.2.1 Oracle properties 

In the penalized optimization literature, some methods are justified at least partially by their 
possession of the oracle property: that for appropriate parameter choices, the method performs 
just as well as an oracle procedure in terms of selecting the correct variables and estimating 
the nonzero coefficients asymptotically in n. Using the HAL prior in Theorem 5 of [3] gives us 
the oracle property if dj — > oo and re _1 / 2 aj — > as n — > oo. It is worth remarking that this 
property requires our prior on (3 to depend on the number of observations, which is atypical in 
Bayesian inference. Intuitively, a,j needs to increase as n increases to ensure that the solution 
remains sparse whilst it cannot increase too quickly or consistency is lost. As pointed out in 
this trade off is impossible to accomplish with the LASSO. 



2.3 Generalizations and extensions 



2.3.1 Exponential power family 



One can model /3j more generally as coming from an exponential power distribution instead of 
a Laplace distribution. In this case, we can write 

With an inverse-gamma prior on r/j, which enjoys conjugacy with respect to the exponential 
power distribution, we obtain 

p{Pj\aj,bj,q) = J - w + 1 

2r(oj)r(i + l/q)b) /q V bj 

Use of this prior results in the same algorithm but with the weights given by 

w (t) = aj + Vg 
3 bj + lpV]* 
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The use of an exponential power prior can be motivated hierarchically as a scale mixture of 
normal distributions for q £ [1,2) |llj or as a scale mixture of uniform distributions for q £ 
(l,oo) [12] . For q £ (0,1) this distribution is still defined but it does not have the same 
interpretation as when g > 1 and additionally has a non-concave density which complicates 
computation of posterior modes. The choice q = 2 corresponds to a normal distribution and 
after marginalizing out 77 it gives a scaled i-distribution with 2a j degrees of freedom and scale 
y^bjjaj. This choice leads to a hierarchical adaptive ^-regularized method that may be suitable 
for problems in which prediction instead of variable selection is more important. 

Contour plots of the negative log density of the joint prior for two variables are given in Figure 
[T] and thresholding plots associated with the priors are given in Figure [2] The contour plots 
show graphically how the LASSO and HAL approaches give sparse solutions whilst the hier- 
archical adaptive ridge (HAR) prior, corresponding to q = 2, gives non-sparse solutions. The 
thresholding plots show that whilst the LASSO significantly biases even large coefficients, the 
HAL and HAR do not. 




(a) LASSO (b) HAL (c) HAR 

Figure 1: Two-dimensional contour plots of the penalties, ie. the negative log density, associated 
with the priors. 




(a) LASSO (b) HAL (c) HAR 

Figure 2: Threshold plots associated with the priors. 



2.3.2 The hierarchical adaptive group lasso 

The hierarchical framework allows us to group variables together by making them dependent 
on a shared variable higher up in the hierarchy. For example, letting g : T p — > Tk be a function 
mapping variables to one of K groups and rii be the number of variables in group i we can use 
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the following model: 



^•K 2 (i )~iV(0,a s 2 (i) ),iGT p 



With = {j : g(j) = i}, this gives 



PWM = r((n i + l)/2) 6XP( ^ • 



and so a^, 6j ~ IG(a,i + Wj, 6j + y^jeG 4 l/%l 2 )- The corresponding iterative procedure is 

then 

if 

= arg max log / (y | X, /3, 0) - £ w|* +1) | 1 1 2 

i=l 

where 

(t+i) at + rtj 



The marginal prior on Po i has the density 

but this density is never evaluated in the EM algorithm. 

A related problem to grouped variable selection is known as multi-task learning within the 
machine learning literature, where one wants to solve for 9 = {/3^}f =1 in a variety of L related 
regression models. One approach is to solve the optimization problem 

L v 
6 map = arg max log /j (y» | Xj , /3 W ) + } ] Xj\\f3j\\ 2 

i=l j=l 

where f3j = {f^f \ ■ ■ ■ , £ ^ L [IB]- This type of regularization can be derived using the same 
hierarchical prior used in the group lasso where the coefficients relating to the same covariate 
are 'grouped' together to promote sparsity across the individual (3 estimates, ie. a covariate is 
selected in all the related models or in none of the models. As such, an adaptive version of this 
multi-task learning approach follows the same form as the hierarchical adaptive group lasso. 



2.3.3 Matrix priors 

For the purpose of covariance matrix estimation, ^i-regularization has been used on entries 
of the precision matrix Q of a Gaussian graphical model |14| [15] . This corresponds to MAP 
estimation using Laplace priors on each Qij for i < j. We can incorporate this type of prior 
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within our framework by placing inverse Gamma priors on the scale parameters of each Laplace 
distribution. We have 

p^li^nj) = ^exp(-^) 

'ij 'ij 

with p(Q\r) = Y[ I i=iY[ 1 j=iP(^'ij\ T ij) an d T~ij ~ IG(a,ij,bij). Note that in this formulation the 
prior on Q, is non-zero for non-positive-definite values. This allows us to specify 

v v 

p(r\n, AB) = YIY[ IGinj-aij + 1, bij + 

i=l j=i 

One can have the likelihood of observed data Y, p(Y\Q) be zero if fi is not symmetric positive- 
definite. In this case, the posterior and hence the MAP estimate are equivalent to the case 
where the prior takes the form 

i P (n) P (n\A,B) 
p{ 1 ' j " ji P (n) P (n\A,B)dn 

since in both cases we have 

l v (n)p(Y\Q)p(Q\A, B) 



p(Sl\Y,A,B) 



f l v (Q)p(Y\Q)p(n\A, B)dQ 

where V is the set of symmetric positive-definite matrices. Note, however, that the positive- 
definite prior cannot be used to derive the EM algorithm central to our methodology since 
r|f2, A, B is no longer a product of inverse-gamma distributions. 



2.3.4 The hierarchical lasso 



In some cases, one might be interested in having rjj = rj for all j £ T p with rj ~ IG(a,b). In 
this case, one obtains a prior on (3 of the form 

m h ^ na+vh) ( zum q , X a ~ p/q 

P(/3|fl ' *' Q) = 2PT(a)T{l + WIT* { ~ + l ) (2) 

which leads to the iterative procedure 

v 

/3(' +1 ) = argmaxlog/(y|X,/3,#) - £ |/3,f 

where 

w (t+i) = Q + P/g 

b + EU^ 

In fact, a more general prior can be constructed by considering groupings of the coefficients 
such that (3j ~ EP(rji,q) for all j G Gi, where Gi is again the set of indices of coefficients in 
group i. A prior constructed in this fashion leads to the iterative procedure 

K 

p( t+ V = argmaxlog/(y|X,/3,#) - E W 

i=l j&G, 

where 

(t+i) ai + rii/q 
w- = ■ 

^ + £ j6Gi l/fl« 
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2.3.5 Modifying the hierarchy 



The above examples are only a subset of the possible modifications to the hierarchy that are 
possible. Indeed, one of the benefits of a hierarchical approach is that one can flexibly group 
variables via the sharing of random variables. Graphical models for the exponential family gen- 
eralization and the grouped variable generalization are given in Figure [3] along with a discussion 



of their relationships to existing methods in Section 3.7 



2.4 Tuning the hyperparameters 

Use of the proposed framework relies on appropriate settings of the hyperparameters. For 
distributions of non-negative Z with density 

, , v- 1 ( Z 

p(Z\u,b) = — f- + l 

the moments of Z are given by 

6*r(i/ - 1 - t)r(t + 1) 



T(u-l) 



which allows one to pick hyperparameters that represent prior beliefs about the mean and 
variance of variables of interest, e.g. \/3j\ in the case of prior (flj) or (X)f=i \ fij\ q ) l ^ q i n the case 
of prior @. 

Focusing on the hierarchical adaptive lasso prior, we note that in this case we have E[|/3j|] = 
bj/{a,j — 1), for a,j > 1. An observation on a,j and bj is that when one increases both values but 
keeps E[|/3j|] constant, the tendency for the iterative scheme to set f3j to zero is reduced since Wj 
is upper-bounded by (dj + l)/bj. This observation could be used in a 'tempered' optimization 



scheme as discussed in Section 2.5 noting in particular that as ou- — > oo, the prior approaches 



a Laplace distribution and so the posterior approaches unimodality. 



2.5 Issues with MAP estimation 



There are many criticisms of MAP estimates in a Bayesian framework. We motivate use of such 
estimates for primarily computational reasons, since Bayesian variable selection methods tend 
to be prohibitively expensive when dealing with large data sets. Beyond the obvious problem of 
summarizing the posterior distribution over models with a point estimate, one problem is that 
MAP estimates are not Bayes estimators but instead a limit of Bayes estimators under the 0-1 
loss function. While important, this issue is not addressed here. A perhaps more fundamental 
issue is that MAP estimates are not invariant under reparametrization. This issue can be 
rectified by finding the point that maximizes posterior density with the Jeffreys measure as the 
dominating measure [16^ I17j. eg. for a likelihood f(x\9) and prior p(9), the parametrization- 
invariant MAP is given by 

Omap = argmax/(x|0)p(0)|/(^)r 1 / 2 

9 

where 1(0) is the Fisher information associated with f(x\9). 
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(a) EP Hierarchy (b) Grouped Hierarchy 



Figure 3: Graphical model representations of the exponential power and the grouped variable 
hierarchies 

An important issue with the MAP estimates obtained from our methodology is that the posterior 
is multimodal and there is no guarantee that one will obtain the global mode of the posterior 
as opposed to a local one. However, this is true of almost all non-convex penalized optimization 
approaches. In [6], a suggestion is to start with high values of bj and reduce the values of bj once 
the algorithm has converged. In principle, this can be done with both a* and bj, noting that such 
an algorithm will still find a local mode of the posterior and this 'tempering' of the posterior 
during optimization can affect which mode is chosen. We do not investigate this further but 
note that characterization of the modes obtained using such a process is an interesting open 
question. 



3 Related approaches 

The proposed approach, either in the hierarchical model or in the estimation step, is closely 
related to many approaches that have been suggested in the literature. One contribution of 
this paper is therefore to provide a Bayesian interpretation of existing methods and a flexible 
framework with which we can incorporate different models. 

3.1 Laplacian scale mixture distributions and compressible priors 

It has come to our attention that the HAL prior has been proposed independently in both 
[8] and [9]. In the former, one obtains the same procedure from a majorization-minimization 
algorithm and in the latter from an EM algorithm. However, our derivation makes explicit the 
flexibility of the hierarchy and generalizes this prior to exponential power families, situations 
with grouped variables and positive-definite matrices, making particularly clear strategies for 
choosing hyperparameters. 
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3.2 Weakly informative priors and non-convex penalization 

With (3j ~ N(0,(Tj) and a 1 - ~ IG(aj,bj) one obtains marginally a t-distribution for (3j\aj,bj. 
This corresponds to the idea of using weakly informative priors as in [18J. For the case where 
f3j ~ Laplace(0, r) and r ~ IG(aj, bj), ie. the hierarchical adaptive lasso, we can similarly think 
of the generalized t-distribution prior on f3j after marginalizing out r as a weakly informative 
prior. In fact, one can think of all of the priors proposed using the hierarchical approach in this 
work as weakly informative. 



3.3 Adaptive methods 



Within the statistical literature, the closest approach is perhaps the adaptive lasso [3], whose 
implementation corresponds to a single step of the exponential power family generalization of 
the HAL with a root-ra consistent estimator of /3 and bj — > 0. As such, the adaptive lasso 
estimator can be thought of as taking an initial estimate and returning an estimate with higher 
posterior density given a logarithmic prior. Our method, on the other hand, finds a local mode 
of the posterior. 

Similarly, the benefit of a polynomial form for the prior density is related to the motivation for 
the smoothly clipped absolute deviation penalty [2]. Indeed, the penalization induced by the 
HAL prior grows slowly so that large values of f3j are not unnecessarily biased while remaining 
continuous and sparse. The methods used to find local linear and local quadratic approxima- 
tions (LLA and LQA) algorithms of [21 H] are also closely related, being iteratively reweighted 
optimization algorithms with a different penalization. 



3.4 Iteratively reweighted ^-minimization 



The basic HAL algorithm is clearly similar to the reweighted-^i approach proposed in [5], which 
is identical except that the weights have the form 



(*) 



A 



3 e + l/fl 

which corresponds to a limiting case where Oj — > A — 1 and bj is set to be small. 

Similarly, the exponential-family generalization of the HAL algorithm is related to the family 
of approaches suggested in |6j for the various ^-penalization norms. As such, the hierarchical 
model for /3 gives an interpretation to the methods in the family of iteratively reweighted 
optimization solutions and, in particular, to the selection of additional parameters e and A. 



3.5 Normal-Exponential-Gamma priors 



Our hierarchical prior differs from that suggested in [TU] in that an inverse gamma prior is 
placed on rj as opposed to Tj. This difference in their work results in a posterior for (3 for 
which it is difficult to obtain MAP estimates [20] , although this problem can be alleviated by 
novel fast methods for computation of the parabolic cylinder function [21J. The marginal prior 
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is a member of the generalized hyperbolic family. This difference also appears in |22j . although 
in that work the full posterior is explored using MCMC. 



3.6 A note on improper priors 

Consider the exponential power density 

^ ,,) = 2^(1 + 1/,)"* ri-) 

with the scale- invariant prior on rjj, p(rjj) oc The prior on f3j after marginalizing out rjj is 

then, regardless of q, improper with the form p(/3j\q) oc l/|/3j|. Since this is the same prior for 
q = 1, which we know will produce sparse (3 and for q = 2, which is the prior proposed in [23], 
this explains why the prior in |23j produces sparse results. However, it is worth noting that the 
posterior for (3 using this prior is improper with unbounded density at (3 = 0. 



3.7 Graphical Model 



Figure [3] gives graphical models for the hierarchies corresponding to the exponential power (EP) 
generalization of section 2.3.1 and the adaptive group lasso of section 2.3.2 These models allow 



us to visualize the flexibility of the framework and the connections with related approaches. 
Indeed, for q = 1 one obtains the hierarchical adaptive lasso or, by setting rjj to be a fixed 
hyperparameter, the standard lasso. Similarly, for q = 2 one obtains hierarchical adaptive ridge 
regression or standard ridge regression. For the hierarchy with grouped variables, the simi- 
larity to the hierarchical adaptive lasso hierarchy is clear, suggesting that application-specific 
hierarchies could be developed that lead to iteratively reweighted methods. 



4 Examples 

4.1 Linear regression 



In linear regression, the likelihood of y given X and (3 is given by 

p(y|X, /3, /x, <5 2 ) = — L-^ exp {~(y„ " X/3) T (y M - X0) } 

where y^ == y — /ul n . If X is standardized, we have l^X = and so we can put an improper 
prior on fi with p(fi) oc 1 and integrate it out so that 

p(y|X, (3, 5 2 ) = -W— exp {-^(y - X/3) T (y - X/3)l (3) 

(2ird 2 ) 2 yjn I zo ) 

where y = \ Ya=i 2/i and y = y - yl n . 
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4.1.1 Fixed 5 2 



If 5 2 is fixed, we proceed as expected. Note that in this case, the Jeffreys prior for /3 is a uniform 
improper prior so no adjustment needs to be made to make the MAP estimate invariant. 

To test the method, we simulated data using f3 = (3, 1.5, 0, 0, 2, 0, 0, 0) T , 5 2 = 1 and X ~ N(0, S) 
with = 0.5^1. We then ran 1000 repetitions of the hierarchical adaptive lasso and the 
standard lasso on this problem with various settings of (a, b) and r respectively with the results 
given in Tables T][2 ons of the hierarchical adaptive lasso and the standard lasso on this problem 



with various settings of (a, b) and r respectively with the results given in Tables T|2 
Table 1: Results for the LASSO (linear regression, 5 = 1 



n 


T 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


0.2 


0.4137 


9.7 


1.808 


0.0 


40 


0.1 


0.4817 


36.7 


0.89 


0.0 


40 


0.02 


1.6732 


90.0 


0.089 


0.015 


80 


0.2 


0.2872 


2.8 


2.519 


0.0 


80 


0.1 


0.2931 


20.0 


1.3510 


0.0 


80 


0.02 


0.8169 


92.7 


0.079 


0.0 



Table 2: Results for the HAL (linear regression, 5 = 1) 



n 


(a, 6) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


(1,0.1) 


0.3118 


89.9 


0.105 


0.0 


40 


(2,0.1) 


0.3044 


98.1 


0.019 


0.0 


40 


(2,0.05) 


0.3026 


99.6 


0.004 


0.0 


80 


(1,0.1) 


0.2191 


81.3 


0.079 


0.0 


80 


(2,0.1) 


0.2061 


96.6 


0.034 


0.0 


80 


(2,0.05) 


0.2038 


98.8 


0.012 


0.0 



Both methods are capable of giving good results in this setting, which has a high signal-to-noise 
ratio. However, the reduction in average error is evident for the HAL, owing mainly to less 
penalization of the selected coefficients. We ran the same experiment but with 5 = 3, to test 
the algorithm with a lower signal-to-noise ratio with the results given in Tables 3]|4 Again, the 



results for the HAL are typically superior to that for the LASSO. However, incorporating prior 
information leading to less penalization of $2 and improves performance drastically. This 
type of prior information is likely to be necessary when we wish to include variables whose true 
coefficients are small. 

Table 3: Results for the LASSO (linear regression, 5 = 3) 



n 


T 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


1/6 


1.9747 


53.5 


0.378 


0.255 


40 


0.125 


2.3952 


43.0 


0.207 


0.580 


40 


0.125* 


2.2117 


93.9 


0.005 


0.057 



* (t 2 ,t 5 ) = (0.25,0.25) 
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Table 4: Results for the HAL (linear regression, 5 = 3) 



n 


(a,b) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


(2,0.75) 


1.3407 


56.2 


0.274 


0.352 


40 


(2,0.1) 


1.7198 


28.0 


0.064 


0.831 


40 


(2,0.1)* 


1.0224 


95.9 


0.004 


0.038 



* (a 2 ,b 2 ,a 5 ,b 5 ) = (2,2,2,2) 



4.1.2 5 2 ~ IG{a 5 ,b 5 ) 

If we model 5 2 ~ IG(a$, bs), we can find that MAP estimate associated with the posterior density 
p(P\y,X), ie. with S 2 integrated out. To do so, we additionally include 5 2 as a latent variable 
in the EM algorithm, noting that conditional on /3, 6 2 and r are independent. Furthermore, we 
have 5 2 \P, X, y ~ IG(a s + (n - l)/2, b s + l/2(y - X/3) T (y - X/3)) For the hierarchical adaptive 
lasso, we iteratively solve 

p(t+i) = _„(*) ( y _ X /3) T (y - X/3) - ^tof |/3,-| 

i=i 

where 

,,(*) = as + (n- l)/2 (t ) aj + 1 

^ 6, + l/2(y - X/3W)^(y - X/3W) ^ 6 . + |^)| 

To test the method, we simulated data using the same as before but letting 5 2 ~ IG(as, bs). We 
then ran 1000 repetitions of the hierarchical adaptive lasso with various settings of (as, bs, a, b) 
with the results given in Table [5] There is clearly more difficulty in estimating the coeffi- 
cients accurately when the variance of the observations is higher and there is again increased 
performance with good prior information. 

Table 5: Results for the HAL (linear regression, random 5 2 ) 



n 


(as,b s ) 


(a,b) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


(3,5) 


(2,0.1) 


0.5509 


90.5 


0.040 


0.070 


40 


(1,1) 


(2,0.1) 


0.7302 


79.8 


0.058 


0.265 


40 


(1,4) 


(2,0.2) 


1.5046 


53.0 


0.085 


0.742 


40 


(1,4) 


(2,0.2)* 


1.1865 


78.0 


0.047 


0.318 



* (a 2 ,b 2 ,a 5 ,b 5 ) = (2,2,2,2) 



4.1.3 Grouped Variable Selection 

For grouped variable selection, we use p = 32 with groups of size 4. We let /3i:4 = (3, 1.5, 2, 0.5)', 
/?9:i2 = (6, 3, 4, 1)', /3i7 : 2o = (1.5, 0.75, 1, 0.25)' with all other components set to 0. The groupings 
of variables were given by G t = {4i + k : k E {1,2, 3, 4}}. As with ungrouped variable selection, 
the hierarchical adaptive version of the group lasso gives lower average errors and has a higher 
percentage of correct models chosen compared to the standard group lasso. 
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Table 6: Results for the GLASSO (linear regression, 5 = 3) 



n 


T 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


1/12 


3.4738 


65.4 


0.580 


1.012 


40 


0.1 


3.1407 


70.5 


0.948 


0.432 



Table 7: Results for the GHAL (linear regression, 5 = 3) 



n 


(a,b) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


(2,0.75) 


2.1267 


89.6 


0.328 


0.144 


40 


(2,0.7) 


2.1205 


91.1 


0.232 


0.176 



4.2 Logistic regression 

In logistic regression with 6 { — 1, 1}, one has p(y|X,/3) = nr=i(l + ex P( — yi(3 T *-i))~ l so the 
log-likelihood is 

n 

logp(y|X, f3) = - £ log(l + exp(-y J /3 T x i )) (4) 
i=l 

The Jeffreys prior for this likelihood is given byp(/3) oc IX'VX] 1 / 2 , where U is a diagonal matrix 
with 

exp(-(3 T Xj) 
Vhi ~ [l + exp(-/3^)]2 

As a result, the parametrization-invariant MAP estimate requires us to minimize 
Pmap = arg min - log /(X|y, 0) + \ log \X'VX\ - logp(0) 

Unfortunately, while — \ log |X'UX| is convex, \ log |X'VX| is not so the resulting minimization 
problem is not convex. However, this does not seem to be a a serious issue in our examples 
as the term log |X'VX| is relatively constant in the regions of high posterior density and so 
including the MAP correction has little effect on the results. 

To test the method, we simulated data using (3 = (3, 1.5, 0, 0, 2, 0, 0, 0) T and X ~ JV(0, E) with 
Sjj = 0.5'* - - 7 ' as with the linear regression simulations. We then ran 1000 repetitions of the 
hierarchical adaptive lasso and the standard lasso on this problem with various settings of (a, b) 
and r respectively with the results given in Tables [8]|9j An interesting result with this example 
is that for (a, b) = (2, 0.1) except for (02, &2j ^5) ^5) = (2, 2, 2, 2), the HAL gave poor results due 
to the correlation of the predictors and the relatively high penalization of In this case, /3\ 
was excluded from the model associated with the MAP estimate in every simulation. Using 
additionally (a\,bi) = (2,0.5) led to a drastic improvement in the results, highlighting the 
importance the prior hyperparameters can have. 

4.3 Gaussian graphical models 

The log-likelihood for this model (after standardization) is 

Ti 71 

io gP (x\n) = -io g |J2|- -tr(srn) 



14 



Table 8: Results for the LASSO (logistic regression) 



n 


T 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


80 


1/7.5 


2.8559 


62.1 


0.387 


0.098 


80 


0.1* 


2.7212 


93.9 


0.008 


0.053 



*(t 2 ,t 5 ) = (1,1) 

Table 9: Results for the HAL (logistic regression) 



n 


(o,6) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


80 


(2,0.65) 


1.3736 


65.4 


0.33 


0.114 


80 


(2,0.1)* 


3.2084 


0.0 


0.00 


1.000 


80 


(2,0.1)t 


1.1228 


99.2 


0.00 


0.008 



* (02,62,05,65) = (2,2,2,2) 

t (01,61,02,62,05,65) = (2,0.5,2,2,2,2) 



where S = l/ra^ i=1 x\ 

Jeffreys prior for this likelihood is given by p(£l) oc |r2|( p+1 )/ 2 so we can find the parametrization- 
invariant MAP using 



n — p — 1 



Qit+V _ ar p- m ax 

& n 2 



p p 



log|^|--tr(^)-^^4' ) m 

i=l j=i 



where 



w 



(*) 



<Hj + 1 



In order for the likelihood with the MAP correction term to be concave, we require n > p + 1 
since — log det is a convex function. 

To test the method, we simulated data using 
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and again used 1000 repetitions of the procedure using the LASSO and the HAL with the results 



given in Tables 10 11 The HAL clearly has superior performance when using hyperparameters 
such that the average number of false positives and false negatives are roughly equal. 



Table 10: Results for the LASSO (GGM) 



n 


T 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


1/45 


4.676 


23.9 


2.789 


1.887 


40 


1/50 


4.22 


22.3 


2.081 


2.139 
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Table 11: Results for the HAL (GGM) 



n 


(a, 6) 


avg. error 


% correct 


avg. false positives 


avg. false negatives 


40 


(1,0.075) 


2.594 


65.4 


1.304 


1.290 


40 


(2,0.1) 


2.850 


57.7 


1.343 


1.507 



5 Discussion 

We have proposed a MAP-based variable selection method using a hierarchical prior for (3 that 
works reasonably well in practice and brings together a variety of related approaches in the lit- 
erature. In particular, the estimate itself corresponds to the solution of a non-convex penalized 
optimization problem, with properties similar to that in [2] , ie. estimates of large coefficients 
tend to be penalized less than in standard ^i-penalized optimization approaches, while still 
being sparse and continuous in the data. A possibly more important contribution is the in- 
terpretation the method gives for various methods that have been proposed without Bayesian 
interpretations, in particular for adaptive or one-step methods in the statistics literature and 
for iteratively reweighted methods in the machine learning and signal processing literatures. 
This interpretation allows for manipulation of the hierarchy in application-specific ways. 

A number of open questions remain when using this class of methodology for variable selection. 
One is how to resolve the issue of multimodality of the posterior due to the non-concavity of the 
log of the prior density. Another is assessing the utility of point estimates when there is little 
guarantee that the model corresponding to the MAP estimate has significant posterior mass 
from a Bayesian variable selection perspective. In this work, we feel these issues are secondary 
as the major contribution is in the Bayesian interpretation and generalization of increasingly 
popular penalized optimization methods amongst practitioners. 
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